$LCSk$++: Practical similarity metric for long strings

نویسندگان

Filip Pavetic

Goran Zuzic

Mile Sikic

چکیده

In this paper we present LCSk++: a new metric for measuring the similarity of long strings, and provide an algorithm for its efficient computation. With ever increasing size of strings occuring in practice, e.g. large genomes of plants and animals, classic algorithms such as Longest Common Subsequence (LCS) fail due to demanding computational complexity. Recently, Benson et al. defined a similarity metric named LCSk. By relaxing the requirement that the k-length substrings should not overlap, we extend their definition into a new metric. An efficient algorithm is presented which computes LCSk++ with complexity of O((|X| + |Y |) log(|X| + |Y |)) for strings X and Y under a realistic random model. The algorithm has been designed with implementation simplicity in mind. Additionally, we describe how it can be adjusted to compute LCSk as well, which gives an improvement of the O(|X||̇Y |) algorithm presented in the original LCSk paper.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Fast and simple algorithms for computing both $LCS_{k}$ and $LCS_{k+}$

Longest Common Subsequence (LCS) deals with the problem of measuring similarity of two strings. While this problem has been analyzed for decades, the recent interest stems from a practical observation that considering single characters is often too simplistic. Therefore, recent works introduce the variants of LCS based on shared substrings of length exactly or at least k (LCSk and LCSk+ respect...

متن کامل

PASS-JOIN: A Partition-based Method for Similarity Joins

As an essential operation in data cleaning, the similarity join has attracted considerable attention from the database community. In this paper, we study string similarity joins with edit-distance constraints, which find similar string pairs from two large sets of strings whose edit distance is within a given threshold. Existing algorithms are efficient either for short strings or for long stri...

متن کامل

EmbedJoin: Eicient Edit Similarity Joins via Embeddings∗

We study the problem of edit similarity joins, where given a set of strings and a threshold value K , we want to output all pairs of strings whose edit distances are at most K . Edit similarity join is a fundamental problem in data cleaning/integration, bioinformatics, collaborative ltering and natural language processing, and has been identied as a primitive operator for database systems. i...

متن کامل

Efficiently Supporting Edit Distance Based String Similarity Search Using B $^+$-Trees

Edit distance is widely used for measuring the similarity between two strings. As a primitive operation, edit distance based string similarity search is to find strings in a collection that are similar to a given query string using edit distance. Existing approaches for answering such string similarity queries follow the filter-and-verify framework by using various indexes. Typically, most appr...

متن کامل

The Analytic Technique and Experimental Research Methods of Post-buckling about Slender Rod Strings in Wellbore

The buckling behavior of rod strings in wellbore is one of the key issues in petroleum engineering. The slender rod strings in vertical wellbore were selected as research objects. Based on the energy method, the critical load formulas of sinusoidal and helical buckling were derived for the string with the bottom of the wellbore pressure. According to the sinusoidal and helical buckling’s geomet...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

CoRR

دوره abs/1407.2407 شماره

صفحات -

تاریخ انتشار 2014

$LCSk$++: Practical similarity metric for long strings

نویسندگان

چکیده

منابع مشابه

Fast and simple algorithms for computing both $LCS_{k}$ and $LCS_{k+}$

PASS-JOIN: A Partition-based Method for Similarity Joins

EmbedJoin: Eicient Edit Similarity Joins via Embeddings∗

Efficiently Supporting Edit Distance Based String Similarity Search Using B $^+$-Trees

The Analytic Technique and Experimental Research Methods of Post-buckling about Slender Rod Strings in Wellbore

عنوان ژورنال:

اشتراک گذاری